input structure
afe434653a898da20044041262b3ac74-AuthorFeedback.pdf
Forexample, thebenchmark from Neuzz19 consists of only a few programs and its size makes it difficult to use our learning based approach that focuses on20 generalization acrossprograms. Wethank thereviewersforpointing outrelevant24 papers, which we will properly cite in our revision. In random environments, these two should38 perform similarly. Reward isgivenafter generating each such input structure.[R1]
Input Matters: Evaluating Input Structure's Impact on LLM Summaries of Sports Play-by-Play
Sundararajan, Barkavi, Sripada, Somayajulu, Reiter, Ehud
A major concern when deploying LLMs in accuracy-critical domains such as sports reporting is that the generated text may not faithfully reflect the input data. We quantify how input structure affects hallucinations and other factual errors in LLM-generated summaries of NBA play-by-play data, across three formats: row-structured, JSON and unstructured. We manually annotated 3,312 factual errors across 180 game summaries produced by two models, Llama-3.1-70B and Qwen2.5-72B. Input structure has a strong effect: JSON input reduces error rates by 69% for Llama and 65% for Qwen compared to unstructured input, while row-structured input reduces errors by 54% for Llama and 51% for Qwen. A two-way repeated measures ANOVA shows that input structure accounts for over 80% of the variance in error rates, with Tukey HSD post hoc tests confirming statistically significant differences between all input formats.
The Hidden Structure -- Improving Legal Document Understanding Through Explicit Text Formatting
Braun, Christian, Lilienbeck, Alexander, Mentjukov, Daniel
Legal contracts possess an inherent, semantically vital structure (e.g., sections, clauses) that is crucial for human comprehension but whose impact on LLM processing remains under-explored. This paper investigates the effects of explicit input text structure and prompt engineering on the performance of GPT-4o and GPT-4.1 on a legal question-answering task using an excerpt of the CUAD. We compare model exact-match accuracy across various input formats: well-structured plain-text (human-generated from CUAD), plain-text cleaned of line breaks, extracted plain-text from Azure OCR, plain-text extracted by GPT-4o Vision, and extracted (and interpreted) Markdown (MD) from GPT-4o Vision. To give an indication of the impact of possible prompt engineering, we assess the impact of shifting task instructions to the system prompt and explicitly informing the model about the structured nature of the input. Our findings reveal that GPT-4o demonstrates considerable robustness to variations in input structure, but lacks in overall performance. Conversely, GPT-4.1's performance is markedly sensitive; poorly structured inputs yield suboptimal results (but identical with GPT-4o), while well-structured formats (original CUAD text, GPT-4o Vision text and GPT-4o MD) improve exact-match accuracy by ~20 percentage points. Optimizing the system prompt to include task details and an advisory about structured input further elevates GPT-4.1's accuracy by an additional ~10-13 percentage points, with Markdown ultimately achieving the highest performance under these conditions (79 percentage points overall exact-match accuracy). This research empirically demonstrates that while newer models exhibit greater resilience, careful input structuring and strategic prompt design remain critical for optimizing the performance of LLMs, and can significantly affect outcomes in high-stakes legal applications.
PDB-Struct: A Comprehensive Benchmark for Structure-based Protein Design
Wang, Chuanrui, Zhong, Bozitao, Zhang, Zuobai, Chaudhary, Narendra, Misra, Sanchit, Tang, Jian
Structure-based protein design has attracted increasing interest, with numerous methods being introduced in recent years. However, a universally accepted method for evaluation has not been established, since the wet-lab validation can be overly time-consuming for the development of new algorithms, and the $\textit{in silico}$ validation with recovery and perplexity metrics is efficient but may not precisely reflect true foldability. To address this gap, we introduce two novel metrics: refoldability-based metric, which leverages high-accuracy protein structure prediction models as a proxy for wet lab experiments, and stability-based metric, which assesses whether models can assign high likelihoods to experimentally stable proteins. We curate datasets from high-quality CATH protein data, high-throughput $\textit{de novo}$ designed proteins, and mega-scale experimental mutagenesis experiments, and in doing so, present the $\textbf{PDB-Struct}$ benchmark that evaluates both recent and previously uncompared protein design methods. Experimental results indicate that ByProt, ProteinMPNN, and ESM-IF perform exceptionally well on our benchmark, while ESM-Design and AF-Design fall short on the refoldability metric. We also show that while some methods exhibit high sequence recovery, they do not perform as well on our new benchmark. Our proposed benchmark paves the way for a fair and comprehensive evaluation of protein design methods in the future. Code is available at https://github.com/WANG-CR/PDB-Struct.
Multi-agent Collective Construction using 3D Decomposition
Srinivasan, Akshaya Kesarimangalam, Singh, Shambhavi, Gutow, Geordan, Choset, Howie, Vundurthy, Bhaskar
This paper addresses a Multi-Agent Collective Construction (MACC) problem that aims to build a three-dimensional structure comprised of cubic blocks. We use cube-shaped robots that can carry one cubic block at a time, and move forward, reverse, left, and right to an adjacent cell of the same height or climb up and down one cube height. To construct structures taller than one cube, the robots must build supporting stairs made of blocks and remove the stairs once the structure is built. Conventional techniques solve for the entire structure at once and quickly become intractable for larger workspaces and complex structures, especially in a multi-agent setting. To this end, we present a decomposition algorithm that computes valid substructures based on intrinsic structural dependencies. We use Mixed Integer Linear Programming (MILP) to solve for each of these substructures and then aggregate the solutions to construct the entire structure. Extensive testing on 200 randomly generated structures shows an order of magnitude improvement in the solution computation time compared to an MILP approach without decomposition. Additionally, compared to Reinforcement Learning (RL) based and heuristics-based approaches drawn from the literature, our solution indicates orders of magnitude improvement in the number of pick-up and drop-off actions required to construct a structure. Furthermore, we leverage the independence between substructures to detect which sub-structures can be built in parallel. With this parallelization technique, we illustrate a further improvement in the number of time steps required to complete building the structure. This work is a step towards applying multi-agent collective construction for real-world structures by significantly reducing solution computation time with a bounded increase in the number of time steps required to build the structure.
Data-Based Design of Multi-Model Inferential Sensors
Mojto, Martin, Lubuลกkรฝ, Karol, Fikar, Miroslav, Paulen, Radoslav
The nonlinear character of industrial processes is usually the main limitation to designing simple linear inferential sensors with sufficient accuracy. In order to increase the inferential sensor predictive performance and yet to maintain its linear structure, multi-model inferential sensors represent a straightforward option. In this contribution, we propose two novel approaches for the design of multi-model inferential sensors aiming to mitigate some drawbacks of the state-of-the-art approaches. For a demonstration of the developed techniques, we design inferential sensors for a Vacuum Gasoil Hydrogenation unit, which is a real-world petrochemical refinery unit. The performance of the multi-model inferential sensor is compared against various single-model inferential sensors and the current (referential) inferential sensor used in the refinery. The results show substantial improvements over the state-of-the-art design techniques for single-/multi-model inferential sensors.
The Impact of Treewidth on Grounding and Solving of Answer Set Programs
Bliem, Bernhard (University of Helsinki, Finland) | Morak, Michael (University of Klagenfurt) | Moldovan, Marius (TU Wien, Vienna, Austria) | Woltran, Stefan (TU Wien, Vienna, Austria)
In this paper, we aim to study how the performance of modern answer set programming (ASP) solvers is influenced by the treewidth of the input program and to investigate the consequences of this relationship. We first perform an experimental evaluation that shows that the solving performance is heavily influenced by treewidth, given ground input programs that are otherwise uniform, both in size and construction. This observation leads to an important question for ASP, namely, how to design encodings such that the treewidth of the resulting ground program remains small. To this end, we study two classes of disjunctive programs, namely guarded and connection-guarded programs. In order to investigate these classes, we formalize the grounding process using MSO transductions. Our main results show that both classes guarantee that the treewidth of the program after grounding only depends on the treewidth (and the maximum degree, in case of connection-guarded programs) of the input instance. In terms of parameterized complexity, our findings yield corresponding FPT results for answer-set existence for bounded treewidth (and also degree, for connection-guarded programs) of the input instance. We further show that bounding treewidth alone leads to NP-hardness in the data complexity for connection-guarded programs, which indicates that the two classes are fundamentally different. Finally, we show that for both classes, the data complexity remains as hard as in the general case of ASP.
Human Not Human
So as I'm getting my feet on the ground, I thought I'd write a binary classification post for mortals. My focus is on the minutia of working with the data, and how to pick a problem that's simple enough to solve from the ground floor. As it happens, I recently installed a fixed position camera to a Raspberry Pi over my porch. I wrote some code to text me when a motion event triggers. And then came a text every 20 minutes with a picture of a cat on it.